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Abstract. When estimating causal effects using observational data, it 
is desirable to replicate a randomized experiment as closely as possi- 
ble by obtaining treated and control groups with similar covariate dis- 
tributions. This goal can often be achieved by choosing well-matched 
samples of the original treated and control groups, thereby reducing 
bias due to the covariates. Since the 1970s, work on matching meth- 
ods has examined how to best choose treated and control subjects for 
comparison. Matching methods are gaining popularity in fields such as 
economics, epidemiology, medicine and political science. However, until 
now the literature and related advice has been scattered across disci- 
plines. Researchers who are interested in using matching methods — or 
developing methods related to matching — do not have a single place to 
turn to learn about past and current research. This paper provides a 
structure for thinking about matching methods and guidance on their 
use, coalescing the existing research (both old and new) and providing 
a summary of where the literature on matching methods is now and 
where it should be headed. 

Key words and phrases: Observational study, propensity scores, sub- 
classification, weighting. 



1. INTRODUCTION 

One of the key benefits of randomized experiments 
for estimating causal effects is that the treated and 
control groups are guaranteed to be only randomly 
different from one another on all background co- 
variates, both observed and unobserved. Work on 
matching methods has examined how to replicate 
this as much as possible for observed covariates with 
observational (nonrandomized) data. Since early 
work in matching, which began in the 1940s, the 
methods have increased in both complexity and use. 
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However, while the field is expanding, there has been 
no single source of information for researchers in- 
terested in an overview of the methods and tech- 
niques available, nor a summary of advice for ap- 
plied researchers interested in implementing these 
methods. In contrast, the research and resources 
have been scattered across disciplines such as statis- 
tics (Rosenbaum, 2002; Rubin, 2006), epidemiology 
(Brookhart et al., 2006), sociology (Morgan and 
Harding, 2006), economics (Imbens, 2004) and po- 
litical science (Ho et al, 2007). This paper coalesces 
the diverse literature on matching methods, bring- 
ing together the original work on matching methods— 
of which many current researchers are not aware — 
and tying together ideas across disciplines. In addi- 
tion to providing guidance on the use of matching 
methods, the paper provides a view of where re- 
search on matching methods should be headed. 

We define "matching" broadly to be any method 
that aims to equate (or "balance") the distribution 
of covariates in the treated and control groups. This 
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may involve 1 : 1 matching, weighting or subclas- 
sification. The use of matching methods is in the 
broader context of the careful design of nonexperi- 
mental studies (Rosenbaum, 1999, 2002; Rubin, 
2007). While extensive time and effort is put into 
the careful design of randomized experiments, rela- 
tively little effort is put into the corresponding "de- 
sign" of nonexperimental studies. In fact, precisely 
because nonexperimental studies do not have the 
benefit of randomization, they require even more 
careful design. In this spirit of design, we can think 
of any study aiming to estimate the effect of some 
intervention as having two key stages: (1) design, 
and (2) outcome analysis. Stage (1) uses only back- 
ground information on the individuals in the study, 
designing the nonexperimental study as would be a 
randomized experiment, without access to the out- 
come values. Matching methods are a key tool for 
stage (1). Only after stage (1) is finished does stage 
(2) begin, comparing the outcomes of the treated 
and control individuals. While matching is generally 
used to estimate causal effects, it is also sometimes 
used for noncausal questions, for example, to inves- 
tigate racial disparities (Schneider, Zaslavsky and 
Epstein, 2004). 

Alternatives to matching methods include adjust- 
ing for background variables in a regression model, 
instrumental variables, structural equation model- 
ing or selection models. Matching methods have a 
few key advantages over those other approaches. First, 
matching methods should not be seen in conflict 
with regression adjustment and, in fact, the two 
methods are complementary and best used in com- 
bination. Second, matching methods highlight areas 
of the covariate distribution where there is not suf- 
ficient overlap between the treatment and control 
groups, such that the resulting treatment effect esti- 
mates would rely heavily on extrapolation. Selection 
models and regression models have been shown to 
perform poorly in situations where there is insuffi- 
cient overlap, but their standard diagnostics do not 
involve checking this overlap (Dehejia and Wahba, 
1999, 2002; Glazerman, Levy and Myers, 2003). 
Matching methods in part serve to make researchers 
aware of the quality of resulting inferences. Third, 
matching methods have straightforward diagnostics 
by which their performance can be assessed. 

The paper proceeds as follows. The remainder of 
Section 1 provides an introduction to matching meth- 
ods and the scenarios considered, including some of 



the history and theory underlying matching meth- 
ods. Sections 2-5 provide details on each of the steps 
involved in implementing matching: defining a dis- 
tance measure, doing the matching, diagnosing the 
matching, and then estimating the treatment effect 
after matching. The paper concludes with sugges- 
tions for future research and practical guidance in 
Section 6. 

1.1 Two Settings 

Matching methods are commonly used in two types 
of settings. The first is one in which the outcome 
values are not yet available and matching is used 
to select subjects for follow-up (e.g., Reinisch et al., 
1995; Stuart and Ialongo, 2009). It is particularly 
relevant for studies with cost considerations that 
prohibit the collection of outcome data for the full 
control group. This was the setting for most of the 
original work in matching methods, particularly the 
theoretical developments, which compared the ben- 
efits of selecting matched versus random samples 
of the control group (Althauser and Rubin, 1970; 
Rubin, 1973a, 1973b). The second setting is one in 
which all of the outcome data is already available, 
and the goal of the matching is to reduce bias in the 
estimation of the treatment effect. 

A common feature of matching methods, which 
is automatic in the first setting but not the sec- 
ond, is that the outcome values are not used in the 
matching process. Even if the outcome values are 
available at the time of the matching, the outcome 
values should not be used in the matching process. 
This precludes the selection of a matched sample 
that leads to a desired result, or even the appear- 
ance of doing so (Rubin, 2007). The matching can 
thus be done multiple times and the matched sam- 
ples with the best balance — the most similar treated 
and control groups — are chosen as the final matched 
samples; this is similar to the design of a random- 
ized experiment where a particular randomization 
may be rejected if it yields poor covariate balance 
(Hill, Rubin and Thomas, 1999; Greevy et al., 2004). 

This paper focuses on settings with a treatment 
defined at some particular point in time, covariates 
measured at (or relevant to) some period of time 
before the treatment, and outcomes measured af- 
ter the treatment. It does not consider more com- 
plex longitudinal settings where individuals may go 
in and out of the treatment group, or where treat- 
ment assignment date is undefined for the control 
group. Methods such as marginal structural models 
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(Robins, Hernan and Brumback, 2000) or balanced 
risk set matching (Li, Propert and Rosenbaum, 2001) 
are useful in those settings. 

1.2 Notation and Background: Estimating 
Causal Effects 

As first formalized in Rubin (1974), the estima- 
tion of causal effects, whether from a randomized 
experiment or a nonexperimental study, is inher- 
ently a comparison of potential outcomes. In par- 
ticular, the causal effect for individual i is the com- 
parison of individual i's outcome if individual i re- 
ceives the treatment (the potential outcome under 
treatment), Yi(l), and individual i's outcome if in- 
dividual i receives the control (the potential out- 
come under control), 1^(0). For simplicity, we use 
the term "individual" to refer to the units that re- 
ceive the treatment of interest, but the formulation 
would stay the same if the units were schools or 
communities. The "fundamental problem of causal 
inference" (Holland, 1986) is that, for each individ- 
ual, we can observe only one of these potential out- 
comes, because each unit (each individual at a par- 
ticular point in time) will receive either treatment 
or control, not both. The estimation of causal effects 
can thus be thought of as a missing data problem 
(Rubin, 1976a), where we are interested in predict- 
ing the unobserved potential outcomes. 

For efficient causal inference and good estimation 
of the unobserved potential outcomes, we would like 
to compare treated and control groups that are as 
similar as possible. If the groups are very different, 
the prediction of Y(l) for the control group will 
be made using information from individuals who 
look very different from themselves, and likewise 
for the prediction of Y(Q) for the treated group. 
A number of authors, including Cochran and Rubin 
(1973), Rubin (1973a, 1973b), Rubin (1979), 
Heckman, Ichimura and Todd (1998), Rubin and 
Thomas, (2000) and Rubin (2001), have shown that 
methods such as linear regression adjustment can 
actually increase bias in the estimated treatment ef- 
fect when the true relationship between the covari- 
ate and outcome is even moderately nonlinear, espe- 
cially when there are large differences in the means 
and variances of the covariates in the treated and 
control groups. 

Randomized experiments use a known random- 
ized assignment mechanism to ensure "balance" of 
the covariates between the treated and control 
groups: The groups will be only randomly different 



from one another on all covariates, observed and un- 
observed. In nonexperimental studies, we must posit 
an assignment mechanism, which determines which 
individuals receive treatment and which receive con- 
trol. A key assumption in nonexperimental studies 
is that of a strongly ignorable treatment assignment 
(Rosenbaum and Rubin, 1983b) which implies that 
(1) treatment assignment (T) is independent of the 
potential outcomes (Y(0), Y(l)) given the covariates 
(X): T_L(Y(0),Y(1))|X, and (2) there is a positive 
probability of receiving each treatment for all val- 
ues of X: < P(T = 1\X) < 1 for all X. The first 
component of the definition of strong ignorability 
is sometimes termed "ignorable," "no hidden bias" 
or "unconfounded." Weaker versions of the ignora- 
bility assumption are sufficient for some quantities 
of interest, as discussed further in Imbens (2004). 
This assumption is often more reasonable than it 
may sound at first since matching on or controlling 
for the observed covariates also matches on or con- 
trols for the unobserved covariates, in so much as 
they are correlated with those that are observed. 
Thus, the only unobserved covariates of concern are 
those unrelated to the observed covariates. Analy- 
ses can be done to assess sensitivity of the results to 
the existence of an unobserved confounder related 
to both treatment assignment and the outcome (see 
Section 6.1.2). Heller, Rosenbaum and Small (2009) 
also discuss how matching can make effect estimates 
less sensitive to an unobserved confounder, using a 
concept called "design sensitivity." An additional as- 
sumption is the Stable Unit Treatment Value 
Assumption (SUTVA; Rubin, 1980), which states 
that the outcomes of one individual are not affected 
by treatment assignment of any other individuals. 
While not always plausible — for example, in school 
settings where treatment and control children may 
interact, leading to "spillover" effects — the plausibil- 
ity of SUTVA can often be improved by design, such 
as by reducing interactions between the treated and 
control groups. Recent work has also begun think- 
ing about how to relax this assumption in analyses 
(Hong and Raudenbush, 2006; Sobel, 2006; 
Hudgens and Halloran, 2008). 

To formalize, using notation similar to that in 
Rubin (1976b), we consider two populations, Pt and 
P c , where the subscript t refers to a group exposed 
to the treatment and c refers to a group exposed 
to the control. Covariate data on p pre-treatment 
covariates is available on random samples of sizes 
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Nt and iV c from Pt and P c . The means and vari- 
ance covariance matrix of the p covariates in group 
i are given by m and Sj, respectively (i = t,c). For 
individual j, the p covariates are denoted by Xj, 
treatment assignment by Tj (Tj = or 1), and the 
observed outcome by Yj . Without loss of generality, 
we assume Nt < N c . 

To define the treatment effect, let E(Y(1)\X) = 
Ri(X) and E(Y(0)\X) = R (X). In the matching 
context effects are usually defined as the difference 
in potential outcomes, t(x) = Ri(x) — Ro(x), 
although other quantities, such as odds ratios, are 
also sometimes of interest. It is often assumed that 
the response surfaces, Rq(x) and Ri(x), are parallel, 
so that t(x) = t for all x. If the response surfaces 
are not parallel (i.e., the effect varies), an average 
effect over some population is generally estimated. 
Variation in effects is particularly relevant when the 
estimands of interest are not difference in means, 
but rather odds ratios or relative risks, for which 
the conditional and marginal effects are not neces- 
sarily equal (Austin, 2007; Lunt et al., 2009). The 
most common estimands in nonexperimental stud- 
ies are the "average effect of the treatment on the 
treated" (ATT), which is the effect for those in the 
treatment group, and the "average treatment effect" 
(ATE), which is the effect on all individuals (treat- 
ment and control). See Imbens (2004), Kurth et al. 
(2006) and Imai, King and Stuart (2008) for further 
discussion of these distinctions. The choice between 
these estimands will likely involve both substantive 
reasons and data availability, as further discussed in 
Section 6.2. 

1.3 History and Theoretical Development of 
Matching Methods 

Matching methods have been in use since the first 
half of the 20th Century (e.g., Greenwood, 1945; 
Chapin, 1947), however, a theoretical basis for these 
methods was not developed until the 1970s. This de- 
velopment began with papers by Cochran and Rubin 
(1973) and Rubin (1973a, 1973b) for situations with 
one covariate and an implicit focus on estimating 
the ATT. Althauser and Rubin (1970) provide an 
early and excellent discussion of some practical is- 
sues associated with matching: how large the control 
"reservoir" should be to get good matches, how to 
define the quality of matches, how to define a "close- 
enough" match. Many of the issues identified in that 
work are topics of continuing debate and discussion. 
The early papers showed that when estimating the 



ATT, better matching scenarios include situations 
with many more control than treated individuals, 
small initial bias between the groups, and smaller 
variance in the treatment group than the control 
group. 

Dealing with multiple covariates was a challenge 
due to both computational and data problems. With 
more than just a few covariates, it becomes very dif- 
ficult to find matches with close or exact values of 
all covariates. For example, Chapin (1947) finds that 
with initial pools of 671 treated and 523 controls 
there are only 23 pairs that match exactly on six 
categorical covariates. An important advance was 
made in 1983 with the introduction of the propen- 
sity score, defined as the probability of receiving the 
treatment given the observed covariates 
(Rosenbaum and Rubin, 1983b). The propensity 
score facilitates the construction of matched sets 
with similar distributions of the covariates, with- 
out requiring close or exact matches on all of the 
individual variables. 

In a series of papers in the 1990s, Rubin and Thomas 
(1992a, 1992b, 1996) provided a theoretical basis for 
multivariate settings with affinely invariant match- 
ing methods and ellipsoidally symmetric covariate 
distributions (such as the normal or i-distribution), 
again focusing on estimating the ATT. Affinely in- 
variant matching methods, such as propensity score 
or Mahalanobis metric matching, are those that yield 
the same matches following an affine (linear) trans- 
formation of the data. Matching in this general set- 
ting is shown to be Equal Percent Bias Reducing 
(EPBR; Rubin, 1976b). Rubin and Stuart (2006) later 
showed that the EPBR feature also holds under much 
more general settings, in which the covariate dis- 
tributions are discriminant mixtures of ellipsoidally 
symmetric distributions. EPBR methods reduce bias 
in all covariate directions (i.e., makes the covariate 
means closer) by the same amount, ensuring that if 
close matches are obtained in some direction (such 
as the propensity score), then the matching is also 
reducing bias in all other directions. The matching 
thus cannot be increasing bias in an outcome that is 
a linear combination of the covariates. In addition, 
matching yields the same percent bias reduction in 
bias for any linear function of X if and only if the 
matching is EPBR. 

Rubin and Thomas (1992b) and Rubin and Thomas 
(1996) obtain analytic approximations for the reduc- 
tion in bias on an arbitrary linear combination of the 
covariates (e.g., the outcome) that can be obtained 
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when matching on the true or estimated discrimi- 
nant (or propensity score) with normally distributed 
covariates. In fact, the approximations hold remark- 
ably well even when the distributional assumptions 
are not satisfied (Rubin and Thomas, 1996). The 
approximations in Rubin and Thomas (1996) can 
be used to determine in advance the bias reduc- 
tion that will be possible from matching, based on 
the covariate distributions in the treated and con- 
trol groups, the size of the initial difference in the 
covariates between the groups, the original sample 
sizes, the number of matches desired and the cor- 
relation between the covariates and the outcome. 
Unfortunately these approximations are rarely used 
in practice, despite their ability to help researchers 
quickly assess whether their data will be useful for 
estimating the causal effect of interest. 

1.4 Steps in Implementing Matching Methods 

Matching methods have four key steps, with the 
first three representing the "design" and the fourth 
the "analysis": 

1. Defining "closeness": the distance measure used 
to determine whether an individual is a good 
match for another. 

2. Implementing a matching method, given that mea- 
sure of closeness. 

3. Assessing the quality of the resulting matched 
samples, and perhaps iterating with steps 1 and 
2 until well-matched samples result. 

4. Analysis of the outcome and estimation of the 
treatment effect, given the matching done in step 
3. 

The next four sections go through these steps one 
at a time, providing an overview of approaches and 
advice on the most appropriate methods. 

2. DEFINING CLOSENESS 

There are two main aspects to determining the 
measure of distance (or "closeness" ) to use in match- 
ing. The first involves which covariates to include, 
and the second involves combining those covariates 
into one distance measure. 

2.1 Variables to Include 

The key concept in determining which covariates 
to include in the matching process is that of strong 
ignorability. As discussed above, matching methods, 
and in fact most nonexperimental study methods, 



rely on ignorability, which assumes that there are no 
unobserved differences between the treatment and 
control groups, conditional on the observed covari- 
ates. To satisfy the assumption of ignorable treat- 
ment assignment, it is important to include in the 
matching procedure all variables known to be re- 
lated to both treatment assignment and the outcome 
(Rubin and Thomas, 1996; Heckman, Ichimura and 
Todd, 1998; Glazerman, Levy and Myers, 2003; 
Hill, Reiter and Zanutto, 2004). Generally poor per- 
formance is found of methods that use a relatively 
small set of "predictors of convenience," such as de- 
mographics only (Shadish, Clark and Steiner, 2008). 
When matching using propensity scores, detailed 
below, there is little cost to including variables that 
are actually unassociated with treatment assignment, 
as they will be of little influence in the propensity 
score model. Including variables that are actually 
unassociated with the outcome can yield slight in- 
creases in variance. However, excluding a potentially 
important confounder can be very costly in terms of 
increased bias. Researchers should thus be liberal in 
terms of including variables that may be associated 
with treatment assignment and/or the outcomes. 
Some examples of matching have 50 or even 100 
covariates included in the procedure (e.g., Rubin, 
2001). However, in small samples it may not be pos- 
sible to include a very large set of variables. In that 
case priority should be given to variables believed to 
be related to the outcome, as there is a higher cost 
in terms of increased variance of including variables 
unrelated to the outcome but highly related to treat- 
ment assignment (Brookhart et al., 2006). Another 
effective strategy is to include a small set of covari- 
ates known to be related to the outcomes of interest, 
do the matching, and then check the balance on all 
of the available covariates, including any additional 
variables that remain particularly unbalanced after 
the matching. To avoid allegations of variable se- 
lection based on estimated effects, it is best if the 
variable selection process is done without using the 
observed outcomes, and instead is based on previous 
research and scientific understanding (Rubin, 2001). 

One type of variable that should not be included 
in the matching process is any variable that may 
have been affected by the treatment of interest 
(Rosenbaum, 1984; Frangakis and Rubin, 2002; 
Greenland, 2003). This is especially important when 
the covariates, treatment indicator and outcomes 
are all collected at the same point in time. If it is 
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deemed to be critical to control for a variable poten- 
tially affected by treatment assignment, it is better 
to exclude that variable in the matching procedure 
and include it in the analysis model for the outcome 
(as in Reinisch et al., 1995). 1 

Another challenge that potentially arises is when 
variables are fully (or nearly fully) predictive of treat- 
ment assignment. Excluding such a variable 
should be done only with great care, with the belief 
that the problematic variable is completely unasso- 
ciated with the outcomes of interest and that the 
ignorability assumption will still hold. More com- 
monly, such a variable indicates a fundamental prob- 
lem in estimating the effect of interest, whereby it 
may not be possible to separate out the effect of the 
treatment of interest from this problematic variable 
using the data at hand. For example, if all adolescent 
heavy drug users are also heavy drinkers, it will be 
impossible to separate out the effect of heavy drug 
use from the effect of heavy drinking. 

2.2 Distance Measures 

The next step is to define the "distance": a mea- 
sure of the similarity between two individuals. There 
are four primary ways to define the distance Dij be- 
tween individuals i and j for matching, all of which 
are afnnely invariant: 

1. Exact: 

JO, ifX i =X j , 
Uii \oo, ifXi^Xj. 

2. Mahalanobis: 

= (Xi — Xj)'T, (Xi — Xj). 

If interest is in the ATT, S is the variance co- 
variance matrix of X in the full control group; 
if interest is in the ATE, then £ is the variance 
covariance matrix of X in the pooled treatment 
and full control groups. If X contains categorical 
variables, they should be converted to a series 
of binary indicators, although the distance works 
best with continuous variables. 

3. Propensity score: 

D%j — | Ci Gj | , 

where is the propensity score for individual k, 
defined in detail below. 

1 The method is misstated in the footnote in Table 1 of 
that paper. In fact, the potential confounding variables were 
not used in the matching procedure, but were utilized in the 
outcome analysis (D. B. Rubin, personal communication). 



4. Linear propensity score: 

= | logit(ej) - logit(ej)|. 

Rosenbaum and Rubin (1985b), Rubin and Thomas 
(1996) and Rubin (2001) have found that match- 
ing on the linear propensity score can be partic- 
ularly effective in terms of reducing bias. 

Below we use "propensity score" to refer to either 
the propensity score itself or the linear version. 

Although exact matching is in many ways the 
ideal (Imai, King and Stuart, 2008), the primary dif- 
ficulty with the exact and Mahalanobis distance mea- 
sures is that neither works very well when X is high 
dimensional. Requiring exact matches often leads to 
many individuals not being matched, which can re- 
sult in larger bias than if the matches are inexact but 
more individuals remain in the analysis 
(Rosenbaum and Rubin, 1985b). A recent advance, 
coarsened exact matching (CEM), can be used to do 
exact matching on broader ranges of the variables; 
for example, using income categories rather than a 
continuous measure (Iacus, King and Porro, 2009). 
The Mahalanobis distance can work quite well when 
there are relatively few covariates (fewer than 8; 
Rubin, 1979; Zhao, 2004), but it does not perform 
as well when the covariates are not normally dis- 
tributed or there are many covariates 
(Gu and Rosenbaum, 1993). This is likely because 
Mahalanobis metric matching essentially regards all 
interactions among the elements of X as equally im- 
portant; with more covariates, Mahalanobis match- 
ing thus tries to match more and more of these 
multi-way interactions. 

A major advance was made in 1983 with the intro- 
duction of propensity scores (Rosenbaum and Rubin, 
1983b). Propensity scores summarize all of the co- 
variates into one scalar: the probability of being 
treated. The propensity score for individual i is de- 
fined as the probability of receiving the treatment 
given the observed covariates: ej(Xj) = P(Ti = l\Xi). 
There are two key properties of propensity scores. 
The first is that propensity scores are balancing scores: 
At each value of the propensity score, the distri- 
bution of the covariates X defining the propensity 
score is the same in the treated and control groups. 
Thus, grouping individuals with similar propensity 
scores replicates a mini-randomized experiment, at 
least with respect to the observed covariates. Sec- 
ond, if treatment assignment is ignorable given the 
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covariates, then treatment assignment is also ignor- 
able given the propensity score. This justifies match- 
ing based on the propensity score rather than on 
the full multivariate set of covariates. Thus, when 
treatment assignment is ignorable, the difference in 
means in the outcome between treated and con- 
trol individuals with a particular propensity score 
value is an unbiased estimate of the treatment effect 
at that propensity score value. While most of the 
propensity score results are in the context of finite 
samples and the settings considered by 
Rubin and Thomas (1992a, 1996), Abadie and Im- 
bens (2009a) discuss the asymptotic properties of 
propensity score matching. 

The distance measures described above can also 
be combined, for example, doing exact matching 
on key covariates such as race or gender followed 
by propensity score matching within those groups. 
When exact matching on even a few variables is not 
possible because of sample size limitations, meth- 
ods that yield "fine balance" (e.g., the same pro- 
portion of African American males in the matched 
treated and control groups) may be a good alterna- 
tive (Rosenbaum, Ross and Silber, 2007). If the key 
covariates of interest are continuous, Mahalanobis 
matching within propensity score calipers 
(Rubin and Thomas, 2000) defines the distance be- 
tween individuals i and j as 

nZi-ztfn-^Zi-Zj), 

Dij = < if | logit(ej) - logit(ej)| < c, 

I oo, if | logit(ej) - logit(ej)| > c, 

where c is the caliper, Z is the set of "key covari- 
ates," and £ is the variance covariance matrix of 
Z. This will yield matches that are relatively well 
matched on the propensity score and particularly 
well matched on Z. Z often consists of pre-treatment 
measures of the outcome, such as baseline test scores 
in educational evaluations. Rosenbaum and Rubin 
(1985b) discuss the choice of caliper size, generaliz- 
ing results from Table 2.3.1 of Cochran and Rubin 
(1973). When the variance of the linear propensity 
score in the treatment group is twice as large as that 
in the control group, a caliper of 0.2 standard de- 
viations removes 98% of the bias in a normally dis- 
tributed covariate. If the variance in the treatment 
group is much larger than that in the control group, 
smaller calipers are necessary. Rosenbaum and Rubin 
(1985b) generally suggest a caliper of 0.25 standard 
deviations of the linear propensity score. 



A more recently developed distance measure is the 
"prognosis score" (Hansen, 2008). Prognosis scores 
are essentially the predicted outcome each individ- 
ual would have under the control condition. The 
benefit of prognosis scores is that they take into ac- 
count the relationship between the covariates and 
the outcome; the drawback is that it requires a model 
for that relationship. Since it thus does not have the 
clear separation of the design and analysis stages 
that we advocate here, we focus instead on other ap- 
proaches, but it is a potentially important advance 
in the matching literature. 

2.2.1 Propensity score estimation and model spec- 
ification In practice, the true propensity scores are 
rarely known outside of randomized experiments and 
thus must be estimated. Any model relating a bi- 
nary variable to a set of predictors can be used. 
The most common for propensity score estimation 
is logistic regression, although nonparametric meth- 
ods such as boosted CART and generalized boosted 
models (gbm) often show very good performance 
(McCaffrey, Ridgeway and Morral, 2004; 
Setoguchi et al., 2008; Lee, Lessler and Stuart, 2009). 

The model diagnostics when estimating propen- 
sity scores are not the standard model diagnostics 
for logistic regression or CART. With propensity 
score estimation, concern is not with the parameter 
estimates of the model, but rather with the resulting 
balance of the covariates (Augurzky and Schmidt, 
2001). Because of this, standard concerns about 
collinearity do not apply. Similarly, since they do not 
use covariate balance as a criterion, model fit statis- 
tics identifying classification ability (such as the c- 
statistic) or stepwise selection models are not helpful 
for variable selection (Rubin, 2004; Brookhart et al., 
2006; Setoguchi et al., 2008). One strategy that is 
helpful is to examine the balance of covariates (in- 
cluding those not originally included in the propen- 
sity score model), their squares and interactions in 
the matched samples. If imbalance is found on par- 
ticular variables or functions of variables, those terms 
can be included in a re-estimated propensity score 
model, which should improve their balance in the 
subsequent matched samples (Rosenbaum and Rubin, 
1984; Dehejia and Wahba, 2002). 

Research indicates that misestimation of the propen- 
sity score (e.g., excluding a squared term that is in 
the true model) is not a large problem, and that 
treatment effect estimates are more biased when the 
outcome model is misspecified than when the propen- 
sity score model is misspecified (Drake, 1993; 
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Dehejia and Wahba, 1999, 2002; Zhao, 2004). This 
may in part be because the propensity score is used 
only as a tool to get covariate balance — the accuracy 
of the model is less important as long as balance is 
obtained. Thus, the exclusion of a squared term, for 
example, may have less severe consequences for a 
propensity score model than it does for the outcome 
model, where interest is in interpreting a particular 
regression coefficient (that on the treatment indica- 
tor). However, these evaluations are fairly limited; 
for example, Drake (1993) considers only two covari- 
ates. Future research should involve more systematic 
evaluations of propensity score estimation, perhaps 
through more sophisticated simulations as well as 
analytic work, and consideration should include how 
the propensity scores will be used, for example, in 
weighting versus subclassification. 

3. MATCHING METHODS 

Once a distance measure has been selected, the 
next step is to use that distance in doing the match- 
ing. In this section we provide an overview of the 
spectrum of matching methods available. The meth- 
ods primarily vary in terms of the number of individ- 
uals that remain after matching and in the relative 
weights that different individuals receive. One way 
in which propensity scores are commonly used is as 
a predictor in the outcome model, where the set of 
individual covariates is replaced by the propensity 
score and the outcome models run in the full treated 
and control groups (Weitzen et al., 2004). Unfortu- 
nately the simple use of this method is not an op- 
timal use of propensity scores, as it does not take 
advantage of the balancing property of propensity 
scores: If there is imbalance on the original covari- 
ates, there will also be imbalance on the propensity 
score, resulting in the same degree of model extrap- 
olation as with the full set of covariates. However, 
if the model regressing the outcome on the treat- 
ment indicator and the propensity score is correctly 
specified or if it includes nonlinear functions of the 
propensity score (such as quantiles or splines) and 
their interaction with the treatment indicator, then 
this can be an effective approach, with links to sub- 
classification (Schafer and Kang, 2008). Since this 
method does not have the clear "design" aspect of 
matching, we do not discuss it further. 

3.1 Nearest Neighbor Matching 

One of the most common, and easiest to imple- 
ment and understand, methods is k : 1 nearest neigh- 
bor matching (Rubin, 1973a). This is generally the 



most effective method for settings where the goal 
is to select individuals for follow-up. Nearest neigh- 
bor matching nearly always estimates the ATT, as it 
matches control individuals to the treated group and 
discards controls who are not selected as matches. 

In its simplest form, 1 : 1 nearest neighbor match- 
ing selects for each treated individual i the control 
individual with the smallest distance from individ- 
ual i. A common complaint regarding 1 : 1 matching 
is that it can discard a large number of observations 
and thus would apparently lead to reduced power. 
However, the reduction in power is often minimal, 
for two main reasons. First, in a two-sample com- 
parison of means, the precision is largely driven by 
the smaller group size (Cohen, 1988). So if the treat- 
ment group stays the same size, and only the con- 
trol group decreases in size, the overall power may 
not actually be reduced very much (Ho et al., 2007). 
Second, the power increases when the groups are 
more similar because of the reduced extrapolation 
and higher precision that is obtained when com- 
paring groups that are similar versus groups that 
are quite different (Snedecor and Cochran, 1980). 
This is also what yields the increased power of using 
matched pairs in randomized experiments 
(Wacholder and Weinberg, 1982). Smith (1997) pro- 
vides an illustration where estimates from 1 : 1 match- 
ing have lower standard deviations than estimates 
from a linear regression, even though thousands of 
observations were discarded in the matching. An ad- 
ditional concern is that, without any restrictions, 
k : 1 matching can lead to some poor matches, if, 
for example, there are no control individuals with 
propensity scores similar to a given treated individ- 
ual. One strategy to avoid poor matches is to im- 
pose a caliper and only select a match if it is within 
the caliper. This can lead to difficulties in interpret- 
ing effects if many treated individuals do not re- 
ceive a match, but can help avoid poor matches. 
Rosenbaum and Rubin (1985a) discuss those trade- 
offs. 

3.1.1 Optimal matching One complication of sim- 
ple ( "greedy" ) nearest neighbor matching is that the 
order in which the treated subjects are matched may 
change the quality of the matches. Optimal match- 
ing avoids this issue by taking into account the over- 
all set of matches when choosing individual matches, 
minimizing a global distance measure (Rosenbaum, 
2002). Generally, greedy matching performs poorly 
when there is intense competition for controls, and 
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performs well when there is little competition 
(Gu and Rosenbaum, 1993). Gu and Rosenbaum 
(1993) find that optimal matching does not in gen- 
eral perform any better than greedy matching in 
terms of creating groups with good balance, but 
does do better at reducing the distance within pairs 
(page 413): "...optimal matching picks about the 
same controls [as greedy matching] but does a bet- 
ter job of assigning them to treated units." Thus, 
if the goal is simply to find well-matched groups, 
greedy matching may be sufficient. However, if the 
goal is well-matched pairs, then optimal matching 
may be preferable. 

3.1.2 Selecting the number of matches: Ratio 
matching When there are large numbers of control 
individuals, it is sometimes possible to get multi- 
ple good matches for each treated individual, called 
ratio matching (Smith, 1997; Rubin and Thomas, 
2000). Selecting the number of matches involves a 
bias :variance trade-off. Selecting multiple controls 
for each treated individual will generally increase 
bias since the 2nd, 3rd and 4th closest matches are, 
by definition, further away from the treated indi- 
vidual than is the 1st closest match. On the other 
hand, utilizing multiple matches can decrease vari- 
ance due to the larger matched sample size. Approx- 
imations in Rubin and Thomas (1996) can help de- 
termine the best ratio. In settings where the out- 
come data has yet to be collected and there are cost 
constraints, researchers must also balance cost con- 
siderations. More methodological work needs to be 
done to more formally quantify the trade-offs in- 
volved. In addition, k : 1 matching is not optimal 
since it does not account for the fact that some 
treated individuals may have many close matches 
while others have very few. A more advanced form of 
ratio matching, variable ratio matching, allows the 
ratio to vary, with different treated individuals re- 
ceiving differing numbers of matches 
(Ming and Rosenbaum, 2001). Variable ratio match- 
ing is related to full matching, described below. 

3.1.3 With or without replacement Another key 
issue is whether controls can be used as matches 
for more than one treated individual: whether the 
matching should be done "with replacement" or 
"without replacement." Matching with replacement 
can often decrease bias because controls that look 
similar to many treated individuals can be used mul- 
tiple times. This is particularly helpful in settings 
where there are few control individuals comparable 



to the treated individuals (e.g., Dehejia and Wahba, 
1999). Additionally, when matching with replace- 
ment, the order in which the treated individuals are 
matched does not matter. However, inference be- 
comes more complex when matching with replace- 
ment, because the matched controls are no longer 
independent — some are in the matched sample more 
than once and this needs to be accounted for in the 
outcome analysis, for example, by using frequency 
weights. When matching with replacement, it is also 
possible that the treatment effect estimate will be 
based on just a small number of controls; the num- 
ber of times each control is matched should be mon- 
itored. 

3.2 Subclassification, Full Matching and 
Weighting 

For settings where the outcome data is already 
available, one apparent drawback of k : 1 nearest neigh- 
bor matching is that it does not necessarily use all 
the data, in that some control individuals, even some 
of those with propensity scores in the range of the 
treatment groups' scores, are discarded and not used 
in the analysis. Weighting, full matching and sub- 
classification methods instead use all individuals. 
These methods can be thought of as giving all indi- 
viduals (either implicit or explicit) weights between 
and 1, in contrast with nearest neighbor matching, 
in which individuals essentially receive a weight of 
either or 1 (depending on whether or not they are 
selected as a match). The three methods discussed 
here represent a continuum in terms of the num- 
ber of groupings formed, with weighting as the limit 
of subclassification as the number of observations 
and subclasses go to infinity (Rubin, 2001) and full 
matching in between. 

3.2.1 Subclassification Subclassification forms 
groups of individuals who are similar, for example, 
as defined by quintiles of the propensity score distri- 
bution. It can estimate either the ATE or the ATT, 
as discussed further in Section 5. One of the first uses 
of subclassification was Cochran (1968), which ex- 
amined subclassification on a single covariate (age) 
in investigating the link between lung cancer and 
smoking. Cochran (1968) provides analytic expres- 
sions for the bias reduction possible using subclassi- 
fication on a univariate continuous covariate; using 
just five subclasses removes at least 90% of the ini- 
tial bias due to that covariate. Rosenbaum and Rubin 
(1985b) extended that to show that creating five 
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propensity score subclasses removes at least 90% of 
the bias in the estimated treatment effect due to 
all of the covariates that went into the propensity 
score. Based on those results, the current conven- 
tion is to use 5-10 subclasses. However, with larger 
sample sizes more subclasses (e.g., 10-20) may be 
feasible and appropriate (Lunceford and Davidian, 
2004). More work needs to be done to help deter- 
mine the optimal number of subclasses: enough to 
get adequate bias reduction but not too many that 
the within-subclass effect estimates become unsta- 
ble. 

3.2.2 Full matching A more sophisticated form of 
subclassification, full matching, selects the number 
of subclasses automatically (Rosenbaum, 1991; 
Hansen, 2004; Stuart and Green, 2008). Full match- 
ing creates a series of matched sets, where each 
matched set contains at least one treated individ- 
ual and at least one control individual (and each 
matched set may have many from either group). 
Like subclassification, full matching can estimate ei- 
ther the ATE or the ATT. Full matching is optimal 
in terms of minimizing the average of the distances 
between each treated individual and each control 
individual within each matched set. Hansen (2004) 
demonstrates the method in the context of estimat- 
ing the effect of SAT coaching. In that example the 
original treated and control groups had propensity 
score differences of 1.1 standard deviations, but the 
matched sets from full matching differed by only 
0.01 to 0.02 standard deviations. Full matching may 
thus have appeal for researchers who are reluctant to 
discard some of the control individuals but who want 
to obtain optimal balance on the propensity score. 
To achieve efficiency gains, Hansen (2004) also in- 
troduces restricted ratios of the number of treated 
individuals to the number of control individuals in 
each matched set. 

3.2.3 Weighting adjustments Propensity scores 
can also be used directly as inverse weights in es- 
timates of the ATE, known as inverse probability of 
treatment weighting (IPTW; Czajka et al., 
1992; Robins, Hernan and Brumback, 2000; 
Lunceford and Davidian, 2004). Formally, the weight 

rp -i rp 

Wi = + tttjS where ik is the estimated propen- 
sity score for individual k. This weighting serves to 
weight both the treated and control groups up to 
the full sample, in the same way that survey sam- 
pling weights weight a sample up to a population 
(Horvitz and Thompson, 1952). 



An alternative weighting technique, weighting by 
the odds, can be used to estimate the ATT 
(Hirano, Imbens and Ridder, 2003). Formally, Wi = 
Ti + (1 - Ti)jhr-. With this weight, treated individ- 
uals receive a weight of 1. Control individuals are 
weighted up to the full sample using the term, 
and then weighted to the treated group using the 
Ci term. In this way both groups are weighted to 
represent the treatment group. 

A third weighting technique, used primarily in 
economics, is kernel weighting, which averages over 
multiple individuals in the control group for each 
treated individual, with weights defined by their dis- 
tance (Imbens, 2000). Heckman, Hidehiko and Todd 
(1997), Heckman et al. (1998) and Heckman, Ichimura 
and Todd (1998) describe a local linear matching 
estimator that requires specifying a bandwidth pa- 
rameter. Generally, larger bandwidths increase bias 
but reduce variance by putting weight on individu- 
als that are further away from the treated individ- 
ual of interest. A complication with these methods 
is this need to define a bandwidth or smoothing pa- 
rameter, which does not generally have an intuitive 
meaning; Imbens (2004) provides some guidance on 
that choice. 

A potential drawback of the weighting approaches 
is that, as with Horvitz-Thompson estimation, the 
variance can be very large if the weights are extreme 
(i.e., if the estimated propensity scores are close to 
or 1). If the model is correctly specified and thus 
the weights are correct, then the large variance is 
appropriate. However, a worry is that some of the 
extreme weights may be related more to the estima- 
tion procedure than to the true underlying proba- 
bilities. Weight trimming, which sets weights above 
some maximum to that maximum, has been pro- 
posed as one solution to this problem (Potter, 1993; 
Scharfstein, Rotnitzky and Robins, 1999). However, 
there is relatively little guidance regarding the trim- 
ming level. Because of this sensitivity to the size 
of the weights and potential model misspecification, 
more attention should be paid to the accuracy of 
propensity score estimates when the propensity 
scores will be used for weighting vs. matching 
(Kang and Schafer, 2007). Another effective strat- 
egy is doubly-robust methods (Bang and Robins, 
2005), which yield accurate effect estimates if either 
the propensity score model or the outcome model 
are correctly specified, as discussed further in Sec- 
tion 5. 
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3.3 Assessing Common Support 

One issue that comes up for all matching meth- 
ods is that of "common support." To this point, we 
have assumed that there is substantial overlap of 
the propensity score distributions in the two groups, 
but potentially density differences. However, in some 
situations there may not be complete overlap in the 
distributions. For example, many of the control indi- 
viduals may be very different from all of the treat- 
ment group members, making them inappropriate 
as points of comparison when estimating the ATT 
(Austin and Mamdani, 2006). Nearest neighbor 
matching with calipers automatically only uses in- 
dividuals in (or close to) the area of common sup- 
port. In contrast, the subclassification and weight- 
ing methods generally use all individuals, regard- 
less of the overlap of the distributions. When us- 
ing those methods it may be beneficial to explicitly 
restrict the analysis to those individuals 
in the region of common support (as in 
Heckman, Hidehiko and Todd, 1997; 
Dehejia and Wahba, 1999). 

Most analyses define common support using the 
propensity score, discarding individuals with propen- 
sity score values outside the range of the other group. 
A second method involves examining the "convex 
hull" of the covariates, identifying the multidimen- 
sional space that allows interpolation rather than 
extrapolation (King and Zeng, 2006). While these 
procedures can help identify who needs to be dis- 
carded, when many subjects are discarded it can 
help the interpretation of results if it is possible to 
define the discard rule using one or two covariates 
rather than the propensity score itself. 

It is also important to consider the implications 
of common support for the estimand of interest. Ex- 
amining the common support may indicate that it 
is not possible to reliably estimate the ATE. This 
could happen, for example, if there are controls out- 
side the range of the treated individuals and thus 
no way to estimate Y(l) for the controls without 
extensive extrapolation. When estimating the ATT 
it may be fine (and in fact beneficial) to discard 
controls outside the range of the treated individu- 
als, but discarding treated individuals may change 
the group for which the results apply (Crump et al., 
2009). 

4. DIAGNOSING MATCHES 

Perhaps the most important step in using match- 
ing methods is to diagnose the quality of the result- 



ing matched samples. All matching should be fol- 
lowed by an assessment of the covariate balance in 
the matched groups, where balance is defined as the 
similarity of the empirical distributions of the full 
set of covariates in the matched treated and con- 
trol groups. In other words, we would like the treat- 
ment to be unrelated to the covariates, such that 
p(X\T = 1) = p(X\T = 0), where p denotes the em- 
pirical distribution. A matching method that results 
in highly imbalanced samples should be rejected, 
and alternative methods should be attempted until 
a well-balanced sample is attained. In some situa- 
tions the diagnostics may indicate that the treated 
and control groups are too far apart to provide reli- 
able estimates without heroic modeling assumptions 
(e.g., Rubin, 2001; Agodini and Dynarski, 2004). In 
contrast to traditional regression models, which do 
not examine the joint distribution of the predic- 
tors (and, in particular, of treatment assignment 
and the covariates), matching methods will make 
it clear when it is not possible to separate the ef- 
fect of the treatment from other differences between 
the groups. A well-specified regression model of the 
outcome with many interactions would show this im- 
balance and may be an effective method for estimat- 
ing treatment effects (Schafer and Kang, 2008), but 
complex models like that are only rarely used. 

When assessing balance we would ideally compare 
the multidimensional histograms of the covariates in 
the matched treated and control groups. However, 
multidimensional histograms are very coarse and /or 
will have many zero cells. We thus are left examin- 
ing the balance of lower-dimensional summaries of 
that joint distribution, such as the marginal distri- 
butions of each covariate. Since we are attempting to 
examine different features of the multidimensional 
distribution, though, it is helpful to do a number of 
different types of balance checks, to obtain a more 
complete picture. 

All balance metrics should be calculated in ways 
similar to how the outcome analyses will be run, as 
discussed further in Section 5. For example, if sub- 
classification was done, the balance measures should 
be calculated within each subclass and then aggre- 
gated. If weights will be used in analyses (either as 
IPTW or because of variable ratio or full matching) , 
they should also be used in calculating the balance 
measures (Joffe et al., 2004). 

4.1 Numerical Diagnostics 

One of the most common numerical balance diag- 
nostics is the difference in means of each covariate, 
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divided by the standard deviation in the full treated 
group: Xt ~ t Xc ■ This measure, sometimes referred to 
as the "standardized bias" or "standardized differ- 
ence in means," is similar to an effect size and is 
compared before and after matching 
(Rosenbaum and Rubin, 1985b). The same standard 
deviation should be used in the standardization be- 
fore and after matching. The standardized difference 
of means should be computed for each covariate, 
as well as two-way interactions and squares. For 
binary covariates, either this same formula can be 
used (treating them as if they were continuous), or 
a simple difference in proportions can be calculated 
(Austin, 2009). 

Rubin (2001) presents three balance measures 
based on the theory in Rubin and Thomas (1996) 
that provide a comprehensive view of covariate bal- 
ance: 

1. The standardized difference of means of the 
propensity score. 

2. The ratio of the variances of the propensity score 
in the treated and control groups. 

3. For each covariate, the ratio of the variance of 
the residuals orthogonal to the propensity score 
in the treated and control groups. 

Rubin (2001) illustrates these diagnostics in an ex- 
ample with 146 covariates. For regression adjust- 
ment to be trustworthy, the absolute standardized 
differences of means should be less than 0.25 and the 
variance ratios should be between 0.5 and 2 (Rubin, 
2001). These guidelines are based both on the as- 
sumptions underlying regression adjustment as well 
as on results in Rubin (1973b) and Cochran and Rubin 
(1973), which used simulations to estimate the bias 
resulting from a number of treatment effect estima- 
tion procedures when the true relationship between 
the covariates and outcome is even moderately non- 
linear. 

Although common, hypothesis tests and p-values 
that incorporate information on the sample size (e.g., 
i-tests) should not be used as measures of 
balance, for two main reasons (Austin, 2007; 
Imai, King and Stuart, 2008). First, balance is in- 
herently an in-sample property, without reference 
to any broader population or super-population. Sec- 
ond, hypothesis tests can be misleading as measures 
of balance, because they often conflate changes in 
balance with changes in statistical power. 
Imai, King and Stuart (2008) show an example where 
randomly discarding control individuals seemingly 



leads to increased balance, simply because of the re- 
duced power. In particular, hypothesis tests should 
not be used as part of a stopping rule to select a 
matched sample when those samples have varying 
sizes (or effective sample sizes) . Some researchers ar- 
gue that hypothesis tests are okay for testing balance 
since the outcome analysis will also have reduced 
power for estimating the treatment effect (Hansen, 
2008), but that argument requires trading off Type 
I and Type II errors. The cost of those two types of 
errors may differ for balance checking and treatment 
effect estimation. 

4.2 Graphical Diagnostics 

With many covariates it can be difficult to care- 
fully examine numeric diagnostics for each; graph- 
ical diagnostics can be helpful for getting a quick 
assessment of the covariate balance. A first step is 
to examine the distribution of the propensity scores 
in the original and matched groups; this is also use- 
ful for assessing common support. Figure 1 shows 
an example with adequate overlap of the propensity 
scores, with a good control match for each treated 
individual. For weighting or sub classification, plots 
such as this can show the dots with their size pro- 
portional to their weight. 

For continuous covariates, we can also examine 
quantile-quantile (QQ) plots, which compare the 

Distribution of Propensity Scores 

Unmatched Treatment Units 



Matched Treatment Units 

. — • 

Matched Control Units 
Unmatched Control Units 



-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 
Propensity Score 

Fig. 1. Matches chosen using 1:1 nearest neighbor match- 
ing on propensity score. Black dots indicate matched individu- 
als; grey unmatched individuals. Data from Stuart and Green 
(2008). 
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Fig. 2. Plot of standardized difference of means of 10 covari- 
ates before and after matching. Data from Stuart and Green 
(2008). 

empirical distributions of each variable in the treated 
and control groups (this could also be done for the 
variables squared or two-way interactions, getting 
at second moments). QQ plots compare the quan- 
tiles of a variable in the treatment group against the 
corresponding quantiles in the control group. If the 
two groups have identical empirical distributions, all 
points would lie on the 45 degree line. For weight- 
ing methods, weighted boxplots can provide similar 
information (Joffe et al., 2004). 

Finally, a plot of the standardized differences of 
means, as in Figure 2, gives us a quick overview of 
whether balance has improved for individual covari- 
ates (Ridgeway, McCaffrey and Morral, 2006). In 
this example the standardized difference of means 
of each covariate has decreased after matching. In 
some situations researchers may find that the stan- 
dardized difference of means of a few covariates will 
increase. This may be particularly true of covari- 
ates with small differences before matching, since 
they will not factor heavily into the propensity score 
model (since they are not predictive of treatment 
assignment). In these cases researchers should con- 
sider whether the increase in bias on those covariates 
is problematic, which it may be if those covariates 
are strongly related to the outcome, and modify the 
matching accordingly (Ho et al., 2007). One solu- 
tion for that may be to do Mahalanobis matching 
on those covariates within propensity score calipers. 



5. ANALYSIS OF THE OUTCOME 

Matching methods are not themselves methods for 
estimating causal effects. After the matching has 
created treated and control groups with adequate 
balance (and the observational study thus 
"designed"), researchers can move to the outcome 
analysis stage. This stage will generally involve re- 
gression adjustments using the matched samples, 
with the details of the analysis depending on the 
structure of the matching. A key point is that match- 
ing methods are not designed to "compete" with 
modeling adjustments such as linear regression, and, 
in fact, the two methods have been shown to work 
best in combination (Rubin, 1973b; Carpenter, 1977; 
Rubin, 1979; Robins and Rotnitzky, 1995; 
Heckman, Hidehiko and Todd, 1997; Rubin and 
Thomas, 2000; Glazerman, Levy and Myers, 2003; 
Abadie and Imbens, 2006). This is similar to the 
idea of "double robustness," and the intuition is the 
same as that behind regression adjustment in ran- 
domized experiments, where the regression adjust- 
ment is used to "clean up" small residual covariate 
imbalance between the groups. Matching methods 
should also make the treatment effect estimates less 
sensitive to particular outcome model specifications 
(Ho et al., 2007). 

The following sections describe how outcome anal- 
yses should proceed after each of the major types of 
matching methods described above. When weight- 
ing methods are used, the weights are used directly 
in regression models, for example, using weighted 
least squares. We focus on parametric modeling ap- 
proaches since those are the most commonly used, 
however, nonparametric permutation-based tests, 
such as Fisher's exact test, are also appropriate, as 
detailed in Rosenbaum (2002, 2010). The best re- 
sults are found when estimating marginal treatment 
effects, such as differences in means or differences 
in proportions. Greenland, Robins and Pearl (1999) 
and Austin (2007) discuss some of the challenges in 
estimating noncollapsible conditional treatment ef- 
fects and which matching methods perform best for 
those situations. 

5.1 After k:l Matching 

When each treated individual has received k 
matches, the outcome analysis proceeds using the 
matched samples, as if those samples had been gen- 
erated through randomization. There is debate about 
whether the analysis needs to account for the matched 
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pair nature of the data (Austin, 2007). However, 
there are at least two reasons why it is not necessary 
to account for the matched pairs (Schafer and Kang, 
2008; Stuart, 2008). First, conditioning on the vari- 
ables that were used in the matching process (such 
as through a regression model) is sufficient. Second, 
propensity score matching, in fact, does not guaran- 
tee that the individual pairs will be well-matched on 
the full set of covariates, only that groups of indi- 
viduals with similar propensity scores will have sim- 
ilar covariate distributions. Thus, it is more com- 
mon to simply pool all the matches into matched 
treated and control groups and run analyses using 
the groups as a whole, rather than using the indi- 
vidual matched pairs. In essence, researchers can do 
the exact same analysis they would have done us- 
ing the original data, but using the matched data 
instead (Ho et al, 2007). 

Weights need to be incorporated into the analysis 
for matching with replacement or variable 
ratio matching (Dehejia and Wahba, 1999; 
Hill, Reiter and Zanutto, 2004). When matching 
with replacement, control group individuals receive 
a frequency weight that reflects the number of times 
they were selected match. When using vari- 

able ratio matching, control group members receive 
a weight that is proportional to the number of con- 
trols matched to "their" treated individual. For ex- 
ample, if 1 treated individual was matched to 3 con- 
trols, each of those controls receives a weight of 1/3. 
If another treated individual was matched to just 1 
control, that control receives a weight of 1. 

5.2 After Subclassification or Full Matching 

With standard subclassification (e.g., the forma- 
tion of 5 subclasses), effects are generally estimated 
within each subclass and then aggregated across sub- 
classes (Rosenbaum and Rubin, 1984). Weighting the 
subclass estimates by the number of treated individ- 
uals in each subclass estimates the ATT; weighting 
by the overall number of individuals in each subclass 
estimates the ATE. There may be fairly substantial 
imbalance remaining in each subclass and, thus, it is 
important to do regression adjustment within each 
subclass, with the treatment indicator and covari- 
ates as predictors (Lunceford and Davidian, 2004). 
When the number of subclasses is too large — and 
the number of individuals within each subclass too 
small — to estimate separate regression models within 
each subclass, a joint model can be fit, with subclass 
and subclass by treatment indicators (fixed effects). 



This is especially useful for full matching. This es- 
timates a separate effect for each subclass, but as- 
sumes that the relationship between the covariates 
X and the outcome is constant across subclasses. 
Specifically, models such as Yy = (3oj + PijTij +7-Xij + 
eij are fit, where i indexes individuals and j indexes 
subclasses. In this model, f3\j is the treatment ef- 
fect for subclass j, and these effects are aggregated 
across subclasses to obtain an overall treatment ef- 
fect: P = Ylj=i Piji where J is the number of sub- 
classes, Nj is the number of individuals in subclass 
j, and N is the total number of individuals. (This 
formula weights subclasses by their total size, and 
so estimates the ATE, but could be modified to es- 
timate the ATT.) This procedure is somewhat more 
complicated for noncontinuous outcomes when the 
estimand of interest, for example, an odds ratio, is 
noncollapsible. In that case the outcome proportions 
in each treatment group should be aggregated and 
then combined. 

5.3 Variance Estimation 

One of the most debated topics in the literature 
on matching is variance estimation. Researchers dis- 
agree on whether uncertainty in the propensity score 
estimation or the matching procedure needs to be 
taken into account, and, if so, how. Some researchers 
(e.g., Ho et al., 2007) adopt an approach similar to 
randomized experiments, where the models are run 
conditional on the covariates, which are treated as 
fixed and exogenous. Uncertainty regarding the match- 
ing process is not taken into account. Other researchers 
argue that uncertainty in the propensity score model 
needs to be accounted for in any analysis. 
However, in fact, under fairly general conditions 
(Rubin and Thomas, 1996; Rubin and Stuart, 2006), 
using estimated rather than true propensity scores 
leads to an overestimate of variance, implying that 
not accounting for the uncertainty in using estimated 
rather than true values will be conservative in the 
sense of yielding confidence intervals that are wider 
than necessary. Robins, Mark and Newey (1992) also 
show the benefit of using estimated rather than true 
propensity scores. Analytic expressions for the bias 
and variance reduction possible for these situations 
are given in Rubin and Thomas (1992b). Specifically, 
Rubin and Thomas (1992b) states that ". . . with large 
pools of controls, matching using estimated linear 
propensity scores results in approximately half the 
variance for the difference in the matched sample 
means as in corresponding random samples for all 
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covariates uncorrelated with the population discrim- 
inant." This finding has been confirmed in simu- 
lations (Rubin and Thomas, 1996) and an empiri- 
cal example (Hill, Rubin and Thomas, 1999). Thus, 
when it is possible to obtain 100% or nearly 100% 
bias reduction by matching on true or estimated 
propensity scores, using the estimated propensity 
scores will result in more precise estimates of the 
average treatment effect. The intuition is that the 
estimated propensity score accounts for chance im- 
balances between the groups, in addition to the sys- 
tematic differences — a situation where overfitting is 
good. When researchers want to account for the un- 
certainty in the matching, a bootstrap procedure has 
been found to outperform other methods (Lechner, 
2002; Hill and Reiter, 2006). There are also some 
empirical formulas for variance estimation for par- 
ticular matching scenarios (e.g., Abadie and Imbens, 
2006, 2009b; Schafer and Kang, 2008), but this is an 
area for future research. 

6. DISCUSSION 

6.1 Additional Issues 

This section raises additional issues that arise when 
using any matching method, and also provides sug- 
gestions for future research. 

6.1.1 Missing covariate values Most of the liter- 
ature on matching and propensity scores assume 
fully observed covariates, but of course most stud- 
ies have at least some missing data. One possibil- 
ity is to use generalized boosted models to estimate 
propensity scores, as they do not require fully ob- 
served covariates. Another recommended approach 
is to do a simple single imputation of the missing 
covariates and include missing data indicators in 
the propensity score model. This essentially matches 
based both on the observed values and on the miss- 
ing data patterns. Although this is generally not an 
appropriate strategy for dealing with missing data 
(Greenland and Finkle, 1995), it is an effective ap- 
proach in the propensity score context. Although it 
cannot balance the missing values themselves, this 
method will yield balance on the observed covari- 
ates and the missing data patterns (Rosenbaum and 
Rubin, 1984). A more flexible method is to use mul- 
tiple imputation to impute the missing covariates, 
run the matching and effect estimation separately 
within each "complete" data set, and then use the 
multiple imputation combining rules to obtain fi- 
nal effect estimates (Rubin, 1987; Song et al., 2001). 



Qu and Lipkovich (2009) illustrate this method and 
show good results for an adaptation that also in- 
cludes indicators of missing data patterns in the 
propensity score model. 

In addition to development and investigation of 
matching methods that account for missing data, 
one particular area needing development is balance 
diagnostics for settings with missing covariate val- 
ues, including dignostics that allow for nonignorable 
missing data mechanisms. D'Agostino, Jr. and Rubin 
(2000) suggests a few simple diagnostics such as 
assessing available-case means and standard devi- 
ations of the continuous variables, and comparing 
available-case cell proportions for the categorical vari- 
ables and missing-data indicators, but diagnostics 
should be developed that explicitly consider the in- 
teractions between the missing data and treatment 
assignment mechanisms. 

6.1.2 Violation of ignorable treatment assignment 
A critique of any nonexperimental study is that there 
may be unobserved variables related to both treat- 
ment assignment and the outcome, violating the as- 
sumption of ignorable treatment assignment and bi- 
asing the treatment effect estimates. Since ignora- 
bility can never be directly tested, researchers have 
instead developed sensitivity analyses to assess its 
plausibility, and how violations of ignorability may 
affect study conclusions. One type of plausibility 
test estimates an effect on a variable that is known 
to be unrelated to the treatment, such as a pre- 
treatment measure of the outcome variable (as in 
Imbens, 2004), or the difference in outcomes be- 
tween multiple control groups (as in Rosenbaum, 
1987b). If the test indicates that the effect is not 
equal to zero, then the assumption of ignorable treat- 
ment assignment is deemed to be less plausible. 

A second approach is to perform analyses of sen- 
sitivity to an unobserved variable. Rosenbaum and 
Rubin (1983a) extends the ideas of Cornfield (1959), 
examining how strong the correlations would have 
to be between a hypothetical unobserved covariate 
and both treatment assignment and the outcome to 
make the observed treatment effect go away. Simi- 
larly, bounds can be created for the treatment effect, 
given a range of potential correlations of the unob- 
served covariate with treatment assignment and the 
outcome (Rosenbaum, 2002). Although sensitivity 
analysis methods are becoming more and more de- 
veloped, they are still used relatively 
infrequently. Newly available software 
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(McCaffrey, Ridgeway and Morral, 2004; Keele, 
2009) will hopefully help facilitate their adoption 
by more researchers. 

6.1.3 Choosing between methods There are a wide 
variety of matching methods available, and little 
guidance to help applied researchers select between 
them (Section 6.2 makes an attempt). The primary 
advice to this point has been to select the method 
that yields the best balance (e.g., Harder, Stuart and 
Anthony, 2010; Ho et al., 2007; Rubin, 2007). But 
defining the best balance is complex, as it involves 
trading off balance on multiple covariates. Possible 
ways to choose a method include the following: (1) 
the method that yields the smallest standardized 
difference of means across the largest number of co- 
variates, (2) the method that minimizes the stan- 
dardized difference of means of a few particularly 
prognostic covariates, and (3) the method that re- 
sults in the fewest number of "large" standardized 
differences of means (greater than 0.25). Another 
promising direction is work by Diamond and Sekhon 
(2006), which automates the matching procedure, 
finding the best matches according to a set of bal- 
ance measures. Further research needs to compare 
the performance of treatment effect estimates from 
methods using criteria such as those in 
Diamond and Sekhon (2006) and Harder, Stuart and 
Anthony (2010), to determine what the proper cri- 
teria should be and examine issues such as potential 
overfitting to particular measures. 

6.1.4 Multiple treatment doses Throughout this 
discussion of matching, it has been assumed that 
there are just two groups: treated and control. How- 
ever, in many studies there are actually multiple lev- 
els of the treatment (e.g., doses of a drug). 
Rosenbaum (2002) summarizes two methods for deal- 
ing with doses of treatment. In the first method, the 
propensity score is still a scalar function of the co- 
variates (e.g., Joffe and Rosenbaum, 1999; Lu et al., 
2001). In the second method, each of the levels of 
treatment has its own propensity score (e.g., 
Rosenbaum, 1987a; Imbens, 2000) and each propen- 
sity score is used one at a time to estimate the distri- 
bution of responses that would have been observed 
if all individuals had received that dose. 

Encompassing these two approaches, 
Imai and van Dyk (2004) generalizes the propensity 
score to arbitrary treatment regimes (including or- 
dinal, categorical and multidimensional). They pro- 
vide theorems for the properties of this generalized 



propensity score (the propensity function), showing 
that it has properties similar to that of the propen- 
sity score in that adjusting for the low-dimensional 
(not always scalar, but always low-dimensional) 
propensity function balances the covariates. They 
advocate subclassification rather than matching, and 
provide two examples as well as simulations showing 
the performance of adjustment based on the propen- 
sity function. Diagnostics are also complicated in 
this setting, as it becomes more difficult to assess 
the balance of the resulting samples when there are 
multiple treatment levels. Future work is needed to 
examine these issues. 

6.2 Guidance for Practice 

So what are the take-away points and advice re- 
garding when to use each of the many methods dis- 
cussed? While more work is needed to definitively 
answer that question, this section attempts to pull 
together the current literature to provide advice for 
researchers interested in estimating causal effects us- 
ing matching methods. The lessons can be summa- 
rized as follows: 

1. Think carefully about the set of covariates 
to include in the matching procedure, and err on 
the side of including more rather than fewer. Is the 
ignorability assumption reasonable given that set of 
covariates? If not, consider in advance whether there 
are other data sets that may be more appropriate, 
or if there are sensitivity analyses that can be done 
to strengthen the inferences. 

2. Estimate the distance measure that will be 
used in the matching. Linear propensity scores esti- 
mated using logistic regression, or propensity scores 
estimated using generalized boosted models or 
boosted CART, are good choices. If there are a few 
covariates on which particularly close balance is de- 
sired (e.g., pre-treatment measures of the outcome), 
consider using the Mahalanobis distance within 
propensity score calipers. 

3. Examine the common support and implica- 
tions for the estimand. If the ATE is of substan- 
tive interest, is there enough overlap of the treated 
and control groups' propensity scores to estimate 
the ATE? If not, could the ATT be estimated more 
reliably? If the ATT is of interest, are there controls 
across the full range of the treated group, or will it 
be difficult to estimate the effect for some treated 
individuals? 

4. Implement a matching method. 
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• If estimating the ATE, good choices are generally 
IPTW or full matching. 

• If estimating the ATT and there are many more 
control than treated individuals (e.g., more than 
3 times as many), k:l nearest neighbor match- 
ing without replacement is a good choice for its 
simplicity and good performance. 

• If estimating the ATT and there are not (or not 
many) more control than treated individuals, ap- 
propriate choices are generally subclassification, 
full matching and weighting by the odds. 

5. Examine the balance on covariates resulting 
from that matching method. 

• If adequate, move forward with treatment effect 
estimation, using regression adjustment on the 
matched samples. 

• If imbalance on just a few covariates, consider 
incorporating exact or Mahalanobis matching on 
those variables. 

• If imbalance on quite a few covariates, try an- 
other matching method (e.g., move to k : 1 match- 
ing with replacement) or consider changing the 
estimand or the data. 

Even if for some reason effect estimates will not 
be obtained using matching methods, it is worth- 
while to go through the steps outlined here to assess 
the adequacy of the data for answering the ques- 
tion of interest. Standard regression diagnostics will 
not warn researchers when there is insufficient over- 
lap to reliably estimate causal effects; going through 
the process of estimating propensity scores and as- 
sessing balance before and after matching can be 
invaluable in terms of helping researchers move for- 
ward with causal inference with confidence. 

Matching methods are important tools for applied 
researchers and also have many open research ques- 
tions for statistical development. This paper has pro- 
vided an overview of the current literature on match- 
ing methods, guidance for practice and a road map 
for future research. Much research has been done in 
the past 30 years on this topic, however, there are 
still a number of open areas and questions to be an- 
swered. We hope that this paper, combining results 
from a variety of disciplines, will promote awareness 
of and interest in matching methods as an important 
and interesting area for future research. 

7. SOFTWARE APPENDIX 

In previous years software limitations made it 
difficult to implement many of the more advanced 



matching methods. However, recent advances have 
made these methods more and more accessible. 
This section lists some of the major matching pro- 
cedures available. A continuously updated version 
is also available at http://www.biostat.jhsph.edu/ 
~ est uart /prop ensityscoresoftwar e . ht ml . 

• Matching software for R 

- cem, http://gking.harvard.edu/cem/ 

Iacus, S. M., King, G. and Porro, G. (2009). cem: 
Coarsened exact matching software. Can also be 
implemented through Matchlt. 

- Matching, http://sekhon.berkeley.edu/matching 
Sekhon, J. S. (in press). Matching: Multivariate 
and propensity score matching with balance opti- 
mization. Forthcoming, Journal of Statistical Soft- 
ware. Uses automated procedure to select matches, 
based on univariate and multivariate balance di- 
agnostics. Primarily k : 1 matching, allows match- 
ing with or without replacement, caliper, exact. 
Includes built-in effect and variance estimation 
procedures. 

- Matchlt, http://gking.harvard.edu/matchit 

Ho, D. E., Imai, K., King, G. and Stuart, E. A. (in 
press). Matchlt: Nonparametric preprocessing for 
parameteric causal inference. Forthcoming, Jour- 
nal of Statistical Software. Two-step process: does 
matching, then user does outcome analysis. Wide 
array of estimation procedures and matching meth- 
ods available: nearest neighbor, Mahalanobis, caliper, 
exact, full, optimal, subclassification. Built-in nu- 
meric and graphical diagnostics. 

- optmatch, http://cran.r-project.org/web/ 
packages / optmatch / index.html 

Hansen, B. B. and Fredrickson, M. (2009). opt- 
match: Functions for optimal matching. Variable 
ratio, optimal and full matching. Can also be im- 
plemented through Matchlt. 

- PSAgraphics, http://cran.r-project.org/web/ 
packages /PSAgraphics /index.html 
Helmreich, J. E. and Pruzek, R. M. (2009). PSA- 
graphics: Propensity score analysis graphics. Jour- 
nal of Statistical Software 29. Package to do graph- 
ical diagnostics of propensity score methods. 

- rbounds, http:/ /cran. r-project.org/web/packages/ 
rbounds/index.html 

Keele, L. J. (2009). rbounds: An R package for 
sensitivity analysis with matched data. Does anal- 
ysis of sensitivity to assumption of ignorable treat- 
ment assignment. 
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twang, http:/ /cran. r-project.org/web/packages/ 
twang / index . html 

Ridgeway, G., McCaffrey, D. and Morral, A. (2006). 
twang: Toolkit for weighting and analysis of 
nonequivalent groups. Functions for propensity 
score estimating and weighting, nonresponse weight- 
ing, and diagnosis of the weights. Primarily uses 
generalized boosted regression to estimate the 
propensity scores. 

• Matching software for Stata 

cem, http : //gking . harvard . edu/cem/ 

Iacus, S. M., King, G. and Porro, G. (2009). cem: 
Coarsened exact matching software, 
match, http://www.economics.harvard.edu/faculty/ 
imbens /software_imbens 

Abadie, A., Drukker, D., Herr, J. L. and Imbens, 
G. W. (2004). Implementing matching estimators 
for average treatment effects in Stata. The Stata 
Journal 4 290-311. Primarily k : 1 matching (with 
replacement). Allows estimation of ATT or ATE, 
including robust variance estimators, 
pscore, http:/ /www. lrz-muenchen.de/~sobecker/ 
pscore.html 

Becker, S. and Ichino, A. (2002). Estimation of av- 
erage treatment effects based on propensity scores. 
The Stata Journal 2 358-377. Does k : 1 nearest 
neighbor matching, radius (caliper) matching and 
subclassification. 

psmatch2, http:/ /econpapers. repec.org/software/ 
bocbocode/s432001.htm 

Leuven, E. and Sianesi, B. (2003). psmatch2. Stata 
module to perform full Mahalanobis and propen- 
sity score matching, common support graphing, 
and covariate imbalance testing. Allows k : 1 match- 
ing, kernel weighting, Mahalanobis matching. In- 
cludes built-in diagnostics and procedures for es- 
timating ATT or ATE. 

Note: 3 procedures for analysis of sensitivity to 
the ignorability assumption are also available: 
rbounds (for continuous outcomes), mhbounds (for 
categorical outcomes), and sensatt (to be used af- 
ter the pscore procedures). 

rbounds, http://econpapers.repec.org/software/ 
bocbocode / s438301 .htm; 

mhbounds, http: / / ideas.repec.org/ p / diw /diwwpp / 
dp659.html; 

sensatt, http:/ /ideas. repec.org/c/boc/bocode/ 
s456747.html. 

• Matching software for SAS 



SAS usage note: http://support.sas.com/kb/30/ 
971.html 

Greedy 1:1 matching, http://www2.sas.com/ 
proceedings / sugi25 /25/po/ 25p225 .pdf 
Parsons, L. S. (2005). Using SAS software to per- 
form a case-control match on propensity score in 
an observational study. In SAS SUGI 30, Paper 
225-25. 

gmatch macro, http: / / mayoresearch.mayo.edu/mayo / 
research/biostat/upload/gmatch.sas 
Kosanke, J. and Bergstralh, E. (2004). gmatch: 
Match 1 or more controls to cases using the 
GREEDY algorithm. 

Proc assign, http:/ /pubs.amstat.org/doi/abs/10.1198/ 
106186001317114938 

Can be used to perform optimal matching. 
-1:1 Mahalanobis matching within propensity score 
calipers , www. lexj ansen . com /phar masug/ 2006 / 
publichealthresear ch /pr05 . pdf 
Feng, W. W., Jun, Y. and Xu, R. (2005). A method/ 
macro based on propensity score and Mahalanobis 
distance to reduce bias in treatment comparison 
in observational study. 

- vmatch macro, http: / / mayoresearch.mayo.edu / mayo / 
research/biostat/upload/vmatch.sas 
Kosanke, J. and Bergstralh, E. (2004). Match cases 
to controls using variable optimal matching. Vari- 
able ratio matching (optimal algorithm). 

- Weighting, http://www.lexjansen.com/wuss/2006/ 
Analytics/ ANL-Leslie. pdf 

Leslie, S. and Thiebaud, P. (2006). Using propen- 
sity scores to adjust for treatment selection bias. 
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